Double descent

Neural networks tend to show a “double descent” phenomenon, where the test error decreases, increases, and then decreases again as we keep increasing the number of parameters.

However, this is not only for the Neural networks. Similar phenomenon happens with a polynomial function: see

https://twitter.com/adad8m/status/1582231644223987712
https://mobile.twitter.com/daniela_witten/status/1292293102103748609
https://twitter.com/francoisfleuret/status/1269301689095503872

The key is Regularization. When the Degree of freedom or the flexibility of the model is bigger than the number of datapoints, there exist many possible solutions. When we choose the solution with the regularization (e.g., the parameters that minimize the L2 or L1 norm), we are imposing a big constraint on the solution space. Thus, we don’t get to see the wildly-overfit solutions, but pretty good solutions that explain both the training and test dataset well.

Curth2024classical